{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Graph Pipeline Demo"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "DataProfiler can also load and profile graph datasets. Similarly to the rest of DataProfiler profilers, this is split into two components:\n",
    "- GraphData\n",
    "- GraphProfiler\n",
    "\n",
    "We will demo the use of this graph pipeline.\n",
    "\n",
    "First, let's import the libraries needed for this example."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "import sys\n",
    "import pprint\n",
    "\n",
    "try:\n",
    "    sys.path.insert(0, '..')\n",
    "    import dataprofiler as dp\n",
    "except ImportError:\n",
    "    import dataprofiler as dp\n",
    "\n",
    "data_path = \"../dataprofiler/tests/data\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We now input our dataset into the generic DataProfiler pipeline:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "data = dp.Data(os.path.join(data_path, \"csv/graph_data_csv_identify.csv\"))\n",
    "profile = dp.Profiler(data)\n",
    "\n",
    "report = profile.report()\n",
    "\n",
    "pp = pprint.PrettyPrinter(sort_dicts=False, compact=True)\n",
    "pp.pprint(report)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We notice that the `Data` class automatically detected the input file as graph data. The `GraphData` class is able to differentiate between tabular and graph csv data. After `Data` matches the input file as graph data, `GraphData` does the necessary work to load the csv data into a NetworkX Graph. \n",
    "\n",
    "`Profiler` runs `GraphProfiler` when graph data is input (or when `data_type=\"graph\"` is specified). The `report()` function outputs the profile for the user."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Profile"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The profile skeleton looks like this:\n",
    "```\n",
    "profile = {\n",
    "    \"num_nodes\": ...,\n",
    "    \"num_edges\": ...,\n",
    "    \"categorical_attributes\": ...,\n",
    "    \"continuous_attributes\": ...,\n",
    "    \"avg_node_degree\": ...,\n",
    "    \"global_max_component_size\": ...,\n",
    "    \"continuous_distribution\": ...,\n",
    "    \"categorical_distribution\": ...,\n",
    "    \"times\": ...,\n",
    "}\n",
    "```\n",
    "\n",
    "Description of properties in profile:\n",
    "- `num_nodes`: number of nodes in the graph\n",
    "- `num_edges`: number of edges in the graph\n",
    "- `categorical_attributes`: list of categorical edge attributes\n",
    "- `continuous_attributes`: list of continuous edge attributes\n",
    "- `avg_node_degree`: average degree of nodes in the graph\n",
    "- `global_max_component_size`: size of largest global max component in the graph\n",
    "- `continuous_distribution`: dictionary of statistical properties for each continuous attribute\n",
    "- `categorical_distribution`: dictionary of statistical properties for each categorical attribute\n",
    "\n",
    "The `continuous_distribution` and `categorical_distribution` dictionaries list statistical properties for each edge attribute in the graph:\n",
    "```\n",
    "continuous_distribution = {\n",
    "    \"name\": ...,\n",
    "    \"scale\": ...,\n",
    "    \"properties\": ...,\n",
    "}\n",
    "```\n",
    "```\n",
    "categorical_distribution = {\n",
    "    \"bin_counts\": ...,\n",
    "    \"bin_edges\": ...,\n",
    "}\n",
    "```\n",
    "Description of each attribute:\n",
    "- Continuous distribution:\n",
    "    - `name`: name of the distribution\n",
    "    - `scale`: negative log likelihood used to scale distributions and compare them in `GraphProfiler`\n",
    "    - `properties`: list of distribution props\n",
    "- Categorical distribution:\n",
    "    - `bin_counts`: histogram bin counts\n",
    "    - `bin_edges`: histogram bin edges\n",
    "\n",
    "`properties` lists the following distribution properties: [optional: shape, loc, scale, mean, variance, skew, kurtosis]. The list can be either 6 length or 7 length depending on the distribution (extra shape parameter):\n",
    "- 6 length: norm, uniform, expon, logistic\n",
    "- 7 length: gamma, lognorm\n",
    "    - gamma: shape=`a` (float)\n",
    "    - lognorm: shape=`s` (float)\n",
    "    \n",
    "For more information on shape parameters `a` and `s`: https://docs.scipy.org/doc/scipy/tutorial/stats.html#shape-parameters"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Saving and Loading a Profile\n",
    "Below you will see an example of how a Graph Profile can be saved and loaded again."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# The default save filepath is profile-<datetime>.pkl\n",
    "profile.save(filepath=\"profile.pkl\")\n",
    "\n",
    "new_profile = dp.GraphProfiler.load(\"profile.pkl\")\n",
    "new_report = new_profile.report()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "pp.pprint(report)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Difference in Data\n",
    "If we wanted to ensure that this new profile was the same as the previous profile that we loaded, we could compare them using the diff functionality."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "diff = profile.diff(new_profile)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "pp.pprint(diff)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Another use for diff might be to provide differences between training and testing profiles as shown in the cell below.\n",
    "We will use the profile above as the training profile and create a new profile to represent the testing profile"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "training_profile = profile\n",
    "\n",
    "testing_data = dp.Data(os.path.join(data_path, \"csv/graph-differentiator-input-positive.csv\"))\n",
    "testing_profile = dp.Profiler(testing_data)\n",
    "\n",
    "test_train_diff = training_profile.diff(testing_profile)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Below you can observe the difference between the two profiles."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "pp.pprint(test_train_diff)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Conclusion"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We have shown the graph pipeline in the DataProfiler. It works similarly to the current DataProfiler implementation."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.13"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
